db: improve metrics and logs with multiple disks being used by a DB #5378

xxmplus · 2025-10-01T03:28:30Z

Log full path to WAL file in disk slow info so it's explicit if the WAL file is on the primary or secondary.

Enhance the log writer to instrument all filesystem operations done for writing the log file in the latency histogram (create, write, sync, etc.).

Fixes #5328

cockroach-teamcity · 2025-10-01T03:28:39Z

This change is

RaduBerinde

Reviewable status: 0 of 19 files reviewed, 2 unresolved discussions (waiting on @annrpom)

record/log_writer_latency_test.go line 17 at r2 (raw file):

)

func TestWALMetricsWithErrorfsLatency(t *testing.T) {

Is this test reliable? In general, relying on timings leads to flakes. Can you run it with --exec 'stress -p 64' and with --exec 'stress -p 64' --race?

More generally, this is a LOT of testing code. ~1500 lines of code for a relatively simple addition - these tests will need to be maintained, sometimes debugged. I don't think it's worth it. Some things are hard to add unit tests for and that's ok, especially when it's an observability thing that has no bearing on correctness.

record/log_writer.go line 837 at r2 (raw file):

		writeStart := crtime.NowMono()
		_, err = w.w.Write(data)
		if writeLatency := writeStart.Elapsed(); w.flusher.walFileMetrics.WriteLatency != nil {

[nit] the assignment here is strange (it's not used in the condition). I'd just inline writeStart.Elapsed().

jbowens

I haven't reviewed the test code yet, but in a cursory glance I found the sheer volume a little daunting. We try to hold our test code to a similar standard of simplicity as production code, often uncoupling test data/cases from test code that performs the mechanics. I'm wondering if we can test these changes with less code, perhaps leaning on either datadriven tests (uncoupling data and test code) or randomized testing (which tends to achieve higher coverage with less test code to maintain).

Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @annrpom and @xxmplus)

metrics.go line 568 at r2 (raw file):

		CloseLatency   prometheus.Histogram // File close operations
		StatLatency    prometheus.Histogram // File stat operations
		OpenDirLatency prometheus.Histogram // Directory open operations

I don't think we need per-operation histograms, but rather one histogram for the primary and one histogram for the secondary, reporting every operation against the individual file.

xxmplus

will do.

Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @annrpom, @jbowens, and @RaduBerinde)

metrics.go line 568 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I don't think we need per-operation histograms, but rather one histogram for the primary and one histogram for the secondary, reporting every operation against the individual file.

These ops have quite different latency and frequency access patterns. If we consolidate them into one, we will lose visibility to the less frequent ops. Is that still ok?

record/log_writer_latency_test.go line 17 at r2 (raw file):

Previously, RaduBerinde wrote…

Is this test reliable? In general, relying on timings leads to flakes. Can you run it with --exec 'stress -p 64' and with --exec 'stress -p 64' --race?

More generally, this is a LOT of testing code. ~1500 lines of code for a relatively simple addition - these tests will need to be maintained, sometimes debugged. I don't think it's worth it. Some things are hard to add unit tests for and that's ok, especially when it's an observability thing that has no bearing on correctness.

I ran it against stress but not race. And ok, I will remove some verbose tests.

xxmplus · 2025-10-08T15:15:37Z

I don't think we need per-operation histograms, but rather one histogram for the primary and one histogram for the secondary, reporting every operation against the individual file.

These ops have quite different latency and frequency access patterns. If we consolidate them into one, we will lose visibility to the less frequent ops. Is that still ok?

I had a chance to sync up with Jackson. While they have different latency profiles, we care more about how file ops impact as a whole. I will update the code to consolidate them and also, not increase the number of metrics.

Log full path to WAL file in disk slow info so it's explicit if the WAL file is on the primary or secondary. Fixes cockroachdb#5328

jbowens

@jbowens reviewed 1 of 2 files at r1, 1 of 17 files at r4, all commit messages.
Reviewable status: 2 of 19 files reviewed, 7 unresolved discussions (waiting on @annrpom, @RaduBerinde, and @xxmplus)

open.go line 295 at r4 (raw file):

	walFileOpHistogram := prometheus.NewHistogram(prometheus.HistogramOpts{
		Name:    "pebble_wal_file_op_duration_nanoseconds",
		Help:    "Histogram of WAL file operation latencies in nanoseconds",

nit: I think we should just exclude the Name and Help and sections; it'll be unused in the context of Cockroach and may confuse

metrics.go line 578 at r4 (raw file):

		prometheus.LinearBuckets(0.0, 0.01, 50),                 // 0 to 0.5ms in 10μs increments
		prometheus.ExponentialBucketsRange(1.0, 10000.0, 50)..., // 1ms to 10s
	)

I think we can unify the histogram configuration, consolidating FsyncLatencyBuckets and WALFileOpLatencyBuckets

db.go line 383 at r4 (raw file):

				// protected by mu.
				fsyncLatency prometheus.Histogram
				// Updated whenever a wal.Writer is closed.

nit: can we keep the comment that metrics.LogWriterMetrics is updated when a WAL writer is closed?

wal/failover_writer.go line 463 at r4 (raw file):

	// WAL file operation latency histogram
	walFileOpHistogram record.WALFileOpHistogram
	directoryType      record.DirectoryType

the failover handles writing to both the primary and secondary. I think we need access to both histograms and be able to construct child LogWriters that use one or the other depending on which device the child LogWriter is writing to.

xxmplus

Reviewable status: 2 of 19 files reviewed, 7 unresolved discussions (waiting on @annrpom, @jbowens, and @RaduBerinde)

db.go line 383 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: can we keep the comment that metrics.LogWriterMetrics is updated when a WAL writer is closed?

ack. added to WALMetrics definition in metrics.go

metrics.go line 578 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think we can unify the histogram configuration, consolidating FsyncLatencyBuckets and WALFileOpLatencyBuckets

ack. done.

open.go line 295 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: I think we should just exclude the Name and Help and sections; it'll be unused in the context of Cockroach and may confuse

removed.

wal/failover_writer.go line 463 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

the failover handles writing to both the primary and secondary. I think we need access to both histograms and be able to construct child LogWriters that use one or the other depending on which device the child LogWriter is writing to.

ack.

github-actions · 2025-11-05T18:32:49Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

RaduBerinde · 2025-11-05T18:34:48Z

Cost: $0.0210 | Duration: 8.7s

Claude just gave new meaning to "my two cents"

Enhance the log writer to instrument all filesystem operations done for writing the log file in the latency histogram (create, write, sync, etc.).

xxmplus · 2025-11-05T20:09:06Z

I believe the AI review bot found a real bug. The PR has been updated accordingly.

xxmplus requested a review from a team as a code owner October 1, 2025 03:28

xxmplus requested a review from annrpom October 1, 2025 03:28

xxmplus force-pushed the i5328-improve-metrics branch 2 times, most recently from e1d6b17 to 4c5809a Compare October 1, 2025 03:35

RaduBerinde reviewed Oct 6, 2025

View reviewed changes

jbowens reviewed Oct 6, 2025

View reviewed changes

xxmplus commented Oct 7, 2025

View reviewed changes

xxmplus force-pushed the i5328-improve-metrics branch 3 times, most recently from 2350d6b to 2d250f8 Compare October 14, 2025 02:53

db: log full path to WAL file in disk slow info

0f13c8c

Log full path to WAL file in disk slow info so it's explicit if the WAL file is on the primary or secondary. Fixes cockroachdb#5328

xxmplus force-pushed the i5328-improve-metrics branch from 2d250f8 to 3d8fdf3 Compare October 22, 2025 19:45

jbowens requested changes Oct 30, 2025

View reviewed changes

xxmplus force-pushed the i5328-improve-metrics branch from 3d8fdf3 to 63a4485 Compare October 30, 2025 23:11

xxmplus commented Oct 30, 2025

View reviewed changes

xxmplus force-pushed the i5328-improve-metrics branch from 63a4485 to 84b5370 Compare October 30, 2025 23:38

xxmplus requested a review from jbowens October 30, 2025 23:55

xxmplus added O-AI-Review and removed O-AI-Review labels Nov 4, 2025

github-actions bot added the O-AI-Review-Potential-Issue-Detected label Nov 5, 2025

db: improve metrics and logs to measure all filesystem ops

4f01c47

Enhance the log writer to instrument all filesystem operations done for writing the log file in the latency histogram (create, write, sync, etc.).

xxmplus force-pushed the i5328-improve-metrics branch from 84b5370 to 4f01c47 Compare November 5, 2025 19:58

xxmplus added the O-AI-Review-Real-Issue-Found label Nov 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

db: improve metrics and logs with multiple disks being used by a DB #5378

db: improve metrics and logs with multiple disks being used by a DB #5378

Uh oh!

xxmplus commented Oct 1, 2025

Uh oh!

cockroach-teamcity commented Oct 1, 2025

Uh oh!

RaduBerinde left a comment

Uh oh!

jbowens left a comment

Uh oh!

xxmplus left a comment

Uh oh!

xxmplus commented Oct 8, 2025

Uh oh!

jbowens left a comment

Uh oh!

xxmplus left a comment

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

RaduBerinde commented Nov 5, 2025

Uh oh!

xxmplus commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

db: improve metrics and logs with multiple disks being used by a DB #5378

Are you sure you want to change the base?

db: improve metrics and logs with multiple disks being used by a DB #5378

Uh oh!

Conversation

xxmplus commented Oct 1, 2025

Uh oh!

cockroach-teamcity commented Oct 1, 2025

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

xxmplus left a comment

Choose a reason for hiding this comment

Uh oh!

xxmplus commented Oct 8, 2025

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

xxmplus left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 5, 2025

Potential Bug(s) Detected

Uh oh!

RaduBerinde commented Nov 5, 2025

Uh oh!

xxmplus commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants